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ABSTRACT 

Objectives The clinical course of multiple sclerosis 
(MS) is highly variable, and research data collection is 
costly and time consuming. We evaluated natural 
language processing techniques applied to electronic 
medical records (EMR) to identify MS patients and the 
key clinical traits of their disease course. 
Materials and methods We used four algorithms 
based on ICD-9 codes, text keywords, and medications 
to identify individuals with MS from a de-identified, 
research version of the EMR at Vanderbilt University. 
Using a training dataset of the records of 899 
individuals, algorithms were constructed to identify and 
extract detailed information regarding the clinical course 
of MS from the text of the medical records, including 
clinical subtype, presence of oligoclonal bands, year of 
diagnosis, year and origin of first symptom, Expanded 
Disability Status Scale (EDSS) scores, timed 25-foot walk 
scores, and MS medications. Algorithms were evaluated 
on a test set validated by two independent reviewers. 
Results We identified 5789 individuals with MS. For all 
clinical traits extracted, precision was at least 87% and 
specificity was greater than 80%. Recall values for 
clinical subtype, EDSS scores, and timed 25-foot walk 
scores were greater than 80%. 
Discussion and conclusion This collection of clinical 
data represents one of the largest databases of detailed, 
clinical traits available for research on MS. This work 
demonstrates that detailed clinical information is 
recorded in the EMR and can be extracted for research 
purposes with high reliability. 



INTRODUCTION 

Patients with multiple sclerosis (MS) have a highly 
variable and poorly understood disease course, 
which varies from relatively minor intermittent and 
resolving neurological deficits to rapid, progressing, 
and permanent neurological deficits. Most research 
studies have focused on the origin of the disease, 
partly because of the difficulty in ascertaining suffi- 
cient longitudinal clinical data to study the disease 
course. Electronic medical records (EMR) may 
provide such a tool. We have previously shown that 
genomic signals of MS risk can be replicated using 
EMR-derived cohorts. 1 2 In this paper, we evalu- 
ated algorithms to identify patients with MS from 
the EMR and created new algorithms to extract 
detailed clinical information for the disease course 
of MS. 

BACKGROUND AND SIGNIFICANCE 

MS is a common, complex autoimmune disease 
with profound impact on the lives of individuals it 
touches. Despite rigorous study, much remains 



unknown about its pathophysiology and origins. 
Many genetic and environmental factors have been 
linked to the development of MS in an individual. 
In the past decade, scores of genetic variants have 
been associated to MS and replicated in subsequent 
studies. 3-5 Smoking, increased distance from the 
equator, and exposure to Epstein-Barr virus have 
been identified as risk factors. 6 7 While we do not 
fully understand how or why the disease develops, 
we know even less about the actual disease course, 
which is highly variable. Clinical data in EMR may 
be a rich resource of information that would allow 
greater research into the disease course of MS. 

Clinical expression of the disease, including age of 
onset, rate of progression, and type and frequency of 
symptoms, varies drastically between individuals. 8-10 
While genetic susceptibility to MS has been widely 
studied, there has been much less focus on its varied 
clinical expression. This is largely due to the difficul- 
ties and expense of collecting detailed longitudinal 
data on the large number of individuals often 
required for studies of complex diseases. However, 
these data are frequently recorded in physician notes 
typed or dictated into the EMR. While data recorded 
in medical records is less standardized than data col- 
lected expressly for research purposes, it is a rich 
resource that could be leveraged for complex dis- 
eases, such as MS. 

Extracting data manually from medical records is 
tedious, time-consuming work that is prone to 
human error. The advent of EMR provides an 
opportunity to drastically shorten the time required 
to extract relevant medical information and 
decrease human error. Despite this promise, 
extracting information from EMR can be challen- 
ging. Typically, multimodal algorithms must be 
created, incorporating EMR components such as 
billing codes, medication data, laboratory values, 
and natural language processing (NLP) to achieve 
high positive predictive values (PPV) to identify 
disease states. 1 11 12 Identification of more detailed 
phenotypes, such as envisioned in 'next-generation 
phenotyping' 13 and drug response phenotypes, is 
more challenging and has only recently been 
explored. 14 15 We conducted a study of MS in a 
de-identified, research version of the EMR at 
Vanderbilt University Medical Center (VUMC) to 
determine the depth and range of clinical informa- 
tion relating to the disease course of MS in 5789 
individuals. 

MATERIALS AND METHODS 
Study population 

All medical records were obtained from the VUMC 
Synthetic Derivative — a research resource of over 
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two million de-identified records, including inpatients and out- 
patients. 1 16 Identifying information is removed from each 
record, including names, places, and identifying numbers, and 
the dates in each person's record are shifted consistently within 
a 3 65 -day window in the past. The EMR at VUMC saw broad 
use as early as the 1990s, although not all clinical specialties 
adopted its use simultaneously. Relevant to this study, the MS 
Center at VUMC was established in 1994 and serves as both a 
primary and tertiary center for the evaluation and treatment of 
MS. The MS Center transitioned to computer-based documen- 
tation in 1997. 

We utilized four previously published algorithms to identify MS 
patients from this database; 1 the algorithms focus on International 
Classification of Diseases, revision 9 (ICD-9) billing codes, pre- 
scribed MS treatments, and keywords located in the text. We made 
minor modifications, including increasing the number of ICD-9 
codes for MS required in the 'definitive type V algorithm to 
require two or more instances and including the ICD-9 code for 
acute transverse myelitis (341.2) to the 'definitive type 2' and 'pos- 
sible type V algorithms. These updated algorithms are publicly 
available on PheKB (http://www.phekb.org/phenotype/ 
multiple-sclerosis-demonstration-project). 

Algorithms to extract detailed clinical traits 

Algorithms to extract clinical data from EMR text were imple- 
mented using Perl to access and search records stored in a 
MySQL database. Algorithms were initially developed using 899 
records as a training dataset and then evaluated using a test set 
of 4890 records. Before algorithm development, we examined 
60 training set records to determine what types of detailed clin- 
ical information related to the MS disease type and its course 
were often available, and how they were expressed in the clin- 
ical notes. We identified eight attributes: clinical subtype; pres- 
ence of oligoclonal bands; year of diagnosis; Expanded 
Disability Status Scale (EDSS) score; timed 25 -foot walk; year 
and origin of first neurological symptom; and MS medications. 
Our goal was to extract data explicitly stated in the medical 
record; we did not infer information (eg, the clinical subtype) 
from descriptions in the text. 

Clinical subtype 

The four clinical subtypes of MS are: relapsing remitting, sec- 
ondary progressive, primary progressive and relapsing progres- 
sive. Subtypes were extracted from clinic notes, letters and 
problem lists (PL) that mentioned MS. Subtypes preceded or 
followed by words suggesting the clinician was not certain, such 
as 'questionable' or 'possible', were excluded by the use of 
regular expressions. As an individual may be classified with dif- 
ferent subtypes over the course of their illness, all distinct sub- 
types mentioned for each individual were kept. 

Oligoclonal bands 

Over 85% of patients with MS have antibodies present in the 
cerebrospinal fluid and not in serum. These are referred to as 
oligoclonal bands and identifying these bands can aid clinicians 
in the diagnosis of MS. 17 As such testing is often performed by 
referring providers (and not repeated at referral centers, such as 
VUMC), it is important to search the clinical documentation in 
addition to laboratory results. We identified clinic notes, letters, 
and PL mentioning oligoclonal bands and extracted 200 charac- 
ters surrounding the word 'oligoclonal'. The result was recorded 
as positive (ie, the clinician stated the test was positive or two or 
more bands were present) or negative (ie, the clinician stated the 
result was negative or no bands were observed) using regular 



expressions. No result was reported if one band was observed 
(inconclusive result). In the event that a person had both a nega- 
tive and a positive result reported, the algorithm ignored the 
data and no conclusive result was recorded. 

Year of diagnosis 

MS is a clinically defined disease and the diagnostic criteria have 
evolved over the past 30 years. 18-20 Hence, the diagnosis of MS 
made by the clinician on a particular patient was based on the set 
of criteria that were relevant and operative at the time of the diag- 
nosis. We extracted the year of diagnosis as recorded by the clin- 
ician, regardless of the definition used. Clinic notes and letters in 
the EMR were examined to identify mentions of the words 'diag- 
nosis' and 'MS'. We identified exact, for example, '1975', and rela- 
tive, for example, '3 years ago', dates that occurred within 70 
characters of 'diagnosis'. 

To determine the most likely diagnosis year, we first looked at 
exact references and recorded the most frequent year as the 
diagnosis year in our database. If no year of diagnosis was 
recorded in an exact reference, we analyzed relative references 
in the same manner. Identifying the most frequently reported 
year removed many typographical errors that were initially 
observed. 

Measures of progression of disease disability 

The EDSS 21 and timed 25 -foot walk 22 are two measures used to 
monitor the progression of MS disability. Both can be recorded 
in structured fields in a manner similar to laboratory values. At 
VUMC, EDSS does not have a structured field but is often men- 
tioned in clinic notes. The MS Center created a structured field 
for the timed 25-foot walk in 2008; however, scores have been 
collected and recorded in the text since 1999. We created algo- 
rithms to extract both of these measures from the narrative text 
in the absence of structured fields. Additional discussion of 
these measures and comparison of timed walk scores extracted 
from the clinical text and structured fields are included in the 
supplementary data (available online only). 

The EDSS has a range from 0 (no disability due to MS) to 10 
(death due to MS), in increments of 0.5. 21 The algorithm to 
extract these values from the text searched for 'EDSS' in notes, 
PL, and communications. Values (0-10) reported within 50 
characters after 'EDSS' were extracted, and the closest number 
within range was recorded as EDSS scores. 

To capitalize on the longitudinal aspect of timed 25 -foot 
walks before structured values were available in 2008, we 
selected notes, then lines of text, from the clinical notes that 
mentioned 'timed walk', '25 feet', or '25 foot'. Times were 
extracted and recorded in seconds. The final output of this algo- 
rithm also noted if a walking aid (eg, cane) was mentioned. 

Year and origin of first neurological symptom 

As the clinical diagnosis of MS requires the presence of two 
lesions disseminated in space and time, patients are rarely diag- 
nosed at the first presentation of neurological symptoms. 
However, the initial presentation of neurological symptoms of 
the disease may be important for research purposes and appears 
to aggregate in families (both the age and type of first neuro- 
logical symptom). 23 While there are many references to symp- 
toms in the narrative text, a complete neurological history must 
be investigated to be confident of identifying the first neuro- 
logical symptom. We noticed that such a history was often 
reported in letters written from physicians at the MS Center to 
referring physicians and we restricted our algorithms to search 
these letters. The algorithm to identify the year of initial 
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neurological symptom selected 100 characters around phrases 
referencing the beginning of the disease course, that is, 'dating 
back' and 'began'. Specific dates were extracted from these 
phrases, either exact or relative. 

To identify the type of first neurological symptom, 250 char- 
acters surrounding phrases that referenced the beginning of the 
disease course were extracted and run through the 
KnowledgeMap concept identifier, 24 25 which is a general 
purpose NLP system supporting negation and word-sense dis- 
ambiguation, similar to MetaMap. 26 Concept unique identifiers 
(CUI) representing neurological symptoms were selected as the 
output of interest, as identified using Unified Medical Language 
System semantic types (see supplementary data, available online 
only). We then used text keywords and CUI to group the symp- 
toms into central nervous system site of origin (brain stem, optic 
nerve, or spinal cord) using a list of MS-related neurological 
symptoms we compiled. Symptoms that did not fall into one of 
these categories were marked as 'other'. If more than one origin 
was identified, all were recorded and the origin was marked 
'polysymptomatic'. Figure 1 provides a schematic of this 
algorithm. 

Medications 

Medications administered for the treatment of MS are fairly 
specific to this disease. MS medications are often discussed in a 
clinic visit with the patient and the patient is sent home with 
pamphlets to determine which medication they wish to start. 
Although VUMC has electronic prescribing tools, many out- 
patient prescriptions (especially in the early 2000s) are only 
documented in the free text of clinical notes, clinical messaging 
systems, or PL, and this has been especially true of the MS 
Center. Discussion of MS medications in narrative text could be 
because the patient is on the medication, the patient failed the 
medication due to continued progression of MS or excessive 
side effects, the clinician is considering the medication for the 



patient in the future, or the patient came into the clinic with 
questions regarding a specific treatment. To retrieve medications 
the patients were actually taking, we focused our efforts on 
extracting MS-related medications from PL only. The goal of 
this algorithm was to determine if a patient was ever on a medi- 
cation. Extracted medications include interferon (3- la, interferon 
p-lb, glatiramer acetate, fingolimod, natalizumab, mitoxantrone, 
and teriflunomide. Text matching, using brand and generic 
names, was done in PL text to create a list of medications the 
patient had taken. Electronic prescribing tools automatically 
update the PL, so this method should also capture electronic 
prescriptions with near-perfect fidelity. 

Evaluation 

We reviewed the Synthetic Derivative records for 367 indivi- 
duals across all case algorithms to create a gold standard for MS 
case status. Individuals were selected randomly within each MS 
algorithm type, and at least 50 individuals per case selection 
algorithm were reviewed. Each individual was categorized as 
diagnosed with MS, possible MS, or no MS, based on clinician 
impressions. 

One hundred records were selected randomly from the test 
set for a blinded evaluation of the clinical trait algorithms. 
These records were reviewed manually for all clinical character- 
istics extracted by algorithms to define a gold standard. The 
reviewer recorded the information the treating clinician(s) 
appeared the most confident in by the end of the record (clinical 
subtype, year of diagnosis). The first 20 records were reviewed 
independently by author SS, a board-certified neurologist and 
founder and chief of the MS Center and a graduate student 
(MFD), with any discrepancies adjudicated by a second board- 
certified internist (JCD), blinded to the source of discrepancy. 
Given high initial concordance (92-100% per trait, median 
99%) the graduate student performed the manual abstraction of 
the following 80 records. Kappa values were between 0.8 and 



Figure 1 Schematic to represent how 
the algorithm to determine the origin 
of first neurological symptom works. 



Dear Dr. *+NAME[WV]: 

I recently had the opportunity to see your patient, Mrs. **NAME[BBB AAA] , in 
the Multiple Sclerosis Clinic at ** INSTITUTION. 

As you are aware, Mrs. **NAME[AAA] is a **AGE[in 20s] -year-old right-handed woman whose 

initial neurological difficulties date back to *+DATE[Dec 1998] when she had 

left eye pain and blurred vision that persisted and resolved within two 

weeks. She had an MRI scan at that time that was reportedly normal, and 

she was told that this was due to a virus; however, she did receive the 

diagnosis of optic neuritis. She did not receive any treatment at that 

time. 

In +*DATE[Jul 1999], she had another episode quite similar to the first. It 

lasted for approximately two week3 and resolved spontaneously. Mere 



i 



Text around 'date back' extracted 



nded woman whose initial neurological difficulties date back to **DATE[Dec 1998] when she had left eye pain and 
blurred vision that persisted and resolved within two weeks. She had an MRI scan at that time that was reportedly 
normal, and she was told that this wa 



Text processed in KnowledgeMap 
(only concepts of interest shown) 
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1.0 for each trait, with a median of 0.96. Manual abstraction 
took an average of 12.6 min per individual record, with a range 
of 1-40 min. 

Demographic and clinical trait extraction data from the subset 
of records reviewed manually were compared to the overall 
dataset and demonstrated that the subset is an accurate represen- 
tation of the overall dataset (see supplementary data, available 
online only). 

PPV for case algorithms were calculated twice, with and 
without possible cases included as true positives. Clinical trait 
data derived from manual abstraction were compared to data 
extracted via the algorithms designed in this study. For all traits, 
recall, precision, specificity, and F-measure were calculated. True 
positives were defined as algorithm-extracted clinical traits that 
matched those found by manual abstraction; more than one 
true positive per person was possible for the EDSS, clinical 
subtype, timed 25 -foot walk, medications, and origin of first 
symptom algorithms. We defined a person as a true negative 
when no values were extracted by the algorithm and when no 
values were also found by manual abstraction. There was thus a 
maximum of one true negative per trait per person. 

RESULTS 

Case selection accuracy 

A total of 5789 individuals was identified as cases by algorithms, 
with 4060 (70%) individuals matching one of the 'definite' cri- 
teria (table 1). PPV ranged from 16% to 96%. Reported demo- 
graphics for all individuals are listed in table 2. Median follow-up 
time by individual was 4.5 years (range 0-20 years). On review, 
there were more false positives in the 'possible type V category 
than desired, including many individuals who were seen in the 
MS Center for other diseases, such as neurosarcoidosis. 

Clinical trait extraction 

Our algorithms extracted information for each clinical trait of 
interest in 903 (16%) to 3523 (61%) out of 5789 total MS indi- 
viduals (table 3). Specificities for all algorithms were high, with 
seven of eight algorithms achieving specificity greater than 90% 
(table 4). Precision ranged from 87% to 99%. For clinical 
subtype and timed 25 -foot walk, recall was at least 90%. 
However, recalls for year of diagnosis and origin of first 
symptom were 33% and 23%, respectively. The F-measure for 
all traits except year of diagnosis and origin of first symptom 
was above 70%. 

After comparison to the gold standard was complete, we 
identified the need for minor changes in the algorithms for 
timed 25 -foot walk, year of first symptom, and origin of first 



Table 1 Counts of individuals selected by case algorithms 



Algorithm 


No of samples 


PPV* (%) 


PPVt (%) 


Definitive type 1 


3975 


96 


96 


Definitive type 2 


85 


64 


79 


Possible type 1 


1315 


16 


64 


Possible type 2 


414 


72 


86 


Total 


5789 







Algorithm details are available at http://www.phekb.org/phenotype/multiple-sclerosis-demon 
stration-project. 

* Possible cases counted as false positives. 
tPossible cases counted as true positives. 
PPV, positive predictive value. 



Table 2 Demographics of all extracted cases 


No of individuals 


Gender 




Female 


4484 


Male 


1305 


Age 




Median 


54 


Range 


8-107 


Deceased 


508 


Ethnicity 




White 


3513 


Black 


440 


Asian 


11 


Hispanic 


16 


Native American 


1 


Unknown 


1808 


Age is calculated for the year 2013 using birth year. Deceased includes individuals 
reported deceased in the EMR by linkage to the social security death index. 
EMR, electronic medical record. 



symptom, which significantly increased recall compared to the 
original algorithms at a nominal p value of 0.05 (p = 0.02, 0.03, 
0.02, respectively; table 5). During compilation into the data- 
base, some spaces and new lines were removed. We allowed for 
such changes by making spaces optional in regular expressions 
for timed walks and year and origin of first symptom. In add- 
ition, we identified another note title that represented letters to 
referring physicians and included the year and origin of first 
symptom. The F-measure for the algorithm of origin of first 
symptom also increased significantly (p = 0.02). 

DISCUSSION 

We identified a large number of individuals with MS and 
detailed clinical information with minimal cost and time 
requirements. Both the MS case algorithms and the algorithms 
to extract detailed MS information performed well, with a pre- 
cision between 87% and 100%. We are unaware of any other 
published dataset of MS patients of this size that has such 
detailed clinical information. This dataset provides a rich 
resource for better understanding MS and also shows that 
extraction of detailed disease states and markers of prognosis in 
patients with chronic disease is possible and may yield a power- 
ful tool in chronic disease research. 

While many studies have identified individuals serving as 
cases and controls for disease status from EMR, 1 11 27 28 this is 



Table 3 Number of individuals for whom information was 


extracted for each clinical trait out of 5789 




Clinical trait 


Individuals, n 


Clinical subtype 


3140 


Oligoclonal bands 


1043 


Year of diagnosis 


1053 


EDSS 


903 


Timed 25-foot walk 


3523 


Year of first symptom 


2301 


Origin of first symptom 


1288 


MS medications 


2586 


EDSS, Expanded Disability Status Scale; MS, multiple sclerosis. 
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Table 4 Statistics of algorithms compared to blinded manual review of 100 charts for all characteristics 



Clinical trait 


Gold standard positives, n* 


Correctly identified, n* 


Recall, % 


Precision, % 


Specificity, % 


F-measure, % 


Clinical MS subtype 


61 


60 


98 


88 


81 


93 


Dlinnrlnnal hanrl*; 
\j\ ilj uliui la i uaiiuo 


28 


20 


71 


87 


97 


78 


Year of diagnosis 


51 


17 


33 


89 


100 


49 


Expanded disability status scale 


75 


61 


81 


94 


100 


87 


Timed 25-foot walk 


120 


99 


83 


99 


100 


90 


Year of first symptom 


56 


24 


43 


100 


100 


60 


Origin of first symptom 


62 


14 


23 


88 


100 


36 


MS medications 


99 


63 


64 


95 


93 


76 



*n refers to how many instances were recorded, not number of individuals. For EDSS, clinical subtype, timed 25-foot walk, medications, and origin of first symptom, this could be more 
than one per individual. 

EDSS, Expanded Disability Status Scale; MS, multiple sclerosis. 



one of the first studies to focus on specific clinical traits of a 
disease by text mining of the EMR. A few other studies have 
used text mining approaches to extract blood pressures, pace- 
maker implantations, and left ventricular ejection fractions as a 
marker of heart failure. 29-31 We have shown that detailed clin- 
ical information valuable to research studies is recorded in 
medical records of individuals with MS, and that this informa- 
tion can be extracted in a highly reliable manner. Such methods 
could potentially be applied across multiple EMR, such as envi- 
sioned by the eMERGE network 32 and SHRINE. 33 

We aimed for high precision to create a reliable database of 
information, rather than focusing on high recall, although the 
resulting recall of many algorithms was high. The ability to 
create highly specific algorithms for these clinical traits is due to 
many factors, many attributable to the nature of the disease 
studied. A diagnosis of MS is rarely given if a patient does not 
meet the criteria that are relatively specific to this disease, and 
diagnosis is generally verified by a neurologist. Treatments for 
MS are rarely used in other diseases. VUMC has a MS Center, 
with only five clinicians since its opening in 1997. This has 
resulted in a large number of clinic notes focused on the disease 
course of MS for each individual and much less variability in 
the style and content of clinic notes than may be found in other 
disease clinics. It should be noted, however, that not all indivi- 
duals whose records we analyzed were enrolled in the MS 
Center or were even seen by a VUMC neurologist. These 
patients were likely to be seen at VUMC for other reasons and 
treated for MS elsewhere. While the 'possible type V algorithm 
identified a number of individuals with MS, the majority of 
individuals had not been definitively diagnosed. Depending on 
the purpose of the study, individuals identified by this algorithm 
should be used with caution. 

Laboratory values are easily extractable via EMR, as each 
result is stored under the type of test done. However, the draw- 
back to using EMR-derived data for laboratory values is that if 
the test was not performed at the primary institution (eg, 
VUMC), it will not be reported in a structured field. For 



example, the test for oligoclonal bands is most commonly 
ordered when trying to make a diagnosis of MS. Indeed, only 
24% of cases had a value for oligoclonal bands in the relevant 
structured fields. Because this is a common test performed when 
diagnosing MS, the result is often echoed in the narrative text. 
We capitalized on clinic note references to extract this informa- 
tion in an additional group of individuals. 

Structured fields in the EMR would also be the most accurate 
way to store and extract non-laboratory data, such as the EDSS 
and timed 25-foot walk measures. Unfortunately, these fields do 
not always contain the desired information due to the nature of 
the data or the EMR, and NLP provides an opportunity to 
recapture these data. We used NLP to extract timed 25 -foot 
walk scores that were recorded before the existence of the struc- 
tured field. Timed 25 -foot walk scores derived from structured 
fields and NLP methods show no significant difference in our 
dataset (figure 2; see supplementary data, available online only), 
further validating NLP methods as a secondary means of data 
extraction. 

Initially, we used MedEx 34 to extract medications but found 
it challenging to produce a medication list with high recall and 
PPV To increase the likelihood that a medication mentioned 
represents one currently being taken, we required the presence 
of dosage and route information (extracted by MedEx). 
However, the majority of MS medications are given in one dose 
and one type of administration, so this information was often 
missing in the clinic record. Therefore, it was difficult to differ- 
entiate, without further NLI? if a medication was one being 
taken or being discussed for another reason. Because of these 
difficulties, we focused on the extraction of medications from 
PL, which contain active lists of medications for each patient. 
By doing this, we gained greater confidence in determining 
which medications a person had been taking. However, PL are 
not always updated, resulting in a lower recall rate than desired. 

The algorithms we have written are not overly intricate, yet 
have yielded an extensive amount of clinical data on a large 
population. Additional work on these scripts could yield even 



Table 5 Statistics of algorithms after additional modifications 



Clinical trait 


Gold standard positives, n* 


Correctly identified, n* 


Recall, % 


Precision, % 


Specificity, % 


F-measure, % 


Timed 25-foot walk 


120 


108 


90 


99 


100 


94 


Year of first symptom 


56 


31 


55 


97 


100 


70 


Origin of first symptom 


62 


21 


34 


88 


93 


49 



*n refers to how many instances were recorded, not number of individuals. For timed 25-foot walk and origin of first symptom, this could be more than one per individual. 



e338 



Davis MF, et al. J Am Med Inform Assoc 2013;20:e334-e340. doi:1 0.1 136/amiajnl-201 3-001 999 



Research and applications 



Timed 25 Foot Walk Scores 



4000- 



co 3000 



*5 2000 

CD 

n 
E 

=5 



1000- 



o - 




Structured field 
Extracted 



10 



20 

Seconds 



30 



Figure 2 Distributions of timed 25-foot walk scores as found in the 
structured fields and extracted from the text of the clinical records. 



greater recall for the clinical traits studied here, and it is likely 
that other clinical traits could be extracted. For example, we did 
not attempt to extract information about the number, length, or 
types of relapses experienced by individuals or start and stop 
dates of medications. In future research, we also hope to extract 
reasons for why medications are halted — ineffectiveness, unsus- 
tainable side effects, patient non-compliance, etc. The scripts 
described in this paper searched for specific references by the 
clinician about clinical traits. They did not use the text to infer 
information, such as diagnosis year or clinical subtype, both of 
which could have been done to enhance recall. In particular, we 
had very low recall in our algorithm to extract diagnosis year. 
On review of instances of algorithm failure, many times we 
missed when a patient was diagnosed in the course of the 
record, as it is rare that a clinician would record the current 
year, instead stating, 'I believe Mr. [NAME] fully meets the cri- 
teria for a diagnosis of MS' or simply listing 'MS' as the final 
impression of the clinic visit. Algorithms targeting current diag- 
noses would greatly improve the recall of this clinical trait. 

The application of the subject selection and clinical trait algo- 
rithms proved to be great tools in the creation of a large dataset 
of MS individuals with longitudinal disease course data at 
VUMC. Further use of these algorithms would be to apply 
them to EMR datasets in other institutions. The subject selec- 
tion algorithms should be easily transferable as there are no 
parts of the algorithm that are specific to VUMC records. The 
transferability of the clinical trait algorithms is likely to vary. We 
expect the most difficult algorithms to transfer would be the age 
and type of first neurological symptom, which rely on clinician- 
specific wording to identify referral letters that contain a history 
with specific key words. The general principle could be carried 
over but evaluation of the clinic notes should be done to evalu- 
ate the format of the notes at the intended university or clinic. 
The presence of oligoclonal bands and timed 25 -foot walk algo- 
rithms rely on no institution-specific formats. Ascertainment of 
structured fields at any institution should first be attempted; 
however, the ease with which we were able to identify these 
scores suggests NLP-derived algorithms would work well at 
other institutions if needed. Additional methods of detecting 



the results in the text could be added if deemed necessary. For 
instance, abbreviations for the timed walk, including 'ft' and 
'T25FW, were not seen in the records we reviewed but they 
may be used at other institutions. We know of no specific 
reasons why the algorithms for age at diagnosis, EDSS, and clin- 
ical subtype would not be transferable. The algorithm for medi- 
cations would depend on the existence of PL at the institution 
of interest. 

CONCLUSIONS 

EMR databases are a rich resource of detailed information of 
the clinical course of MS. This information is extractable from 
clinic notes by simple algorithms, with high specificity, preci- 
sion, and recall. 
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